A radiomics-based decision support tool improves lung cancer diagnosis in combination with the Herder score in large lung nodules

Summary Background Large lung nodules (≥15 mm) have the highest risk of malignancy, and may exhibit important differences in phenotypic or clinical characteristics to their smaller counterparts. Existing risk models do not stratify large nodules well. We aimed to develop and validate an integrated segmentation and classification pipeline, incorporating deep-learning and traditional radiomics, to classify large lung nodules according to cancer risk. Methods 502 patients from five U.K. centres were recruited to the large-nodule arm of the retrospective LIBRA study between July 2020 and April 2022. 838 CT scans were used for model development, split into training and test sets (70% and 30% respectively). An nnUNet model was trained to automate lung nodule segmentation. A radiomics signature was developed to classify nodules according to malignancy risk. Performance of the radiomics model, termed the large-nodule radiomics predictive vector (LN-RPV), was compared to three radiologists and the Brock and Herder scores. Findings 499 patients had technically evaluable scans (mean age 69 ± 11, 257 men, 242 women). In the test set of 252 scans, the nnUNet achieved a DICE score of 0.86, and the LN-RPV achieved an AUC of 0.83 (95% CI 0.77–0.88) for malignancy classification. Performance was higher than the median radiologist (AUC 0.75 [95% CI 0.70–0.81], DeLong p = 0.03). LN-RPV was robust to auto-segmentation (ICC 0.94). For baseline solid nodules in the test set (117 patients), LN-RPV had an AUC of 0.87 (95% CI 0.80–0.93) compared to 0.67 (95% CI 0.55–0.76, DeLong p = 0.002) for the Brock score and 0.83 (95% CI 0.75–0.90, DeLong p = 0.4) for the Herder score. In the international external test set (n = 151), LN-RPV maintained an AUC of 0.75 (95% CI 0.63–0.85). 18 out of 22 (82%) malignant nodules in the Herder 10–70% category in the test set were identified as high risk by the decision-support tool, and may have been referred for earlier intervention. Interpretation The model accurately segments and classifies large lung nodules, and may improve upon existing clinical models. Funding This project represents independent research funded by: 1) Royal Marsden Partners Cancer Alliance, 2) the 10.13039/100016916Royal Marsden Cancer Charity, 3) the 10.13039/501100000272National Institute for Health Research (NIHR) Biomedical Research Centre at the Royal Marsden NHS Foundation Trust and The Institute of Cancer Research, London, 4) the 10.13039/501100000272National Institute for Health Research (NIHR) Biomedical Research Centre at 10.13039/501100000761Imperial College London, 5) 10.13039/501100000289Cancer Research UK (C309/A31316).


Summary
Background Large lung nodules (≥15 mm) have the highest risk of malignancy, and may exhibit important differences in phenotypic or clinical characteristics to their smaller counterparts. Existing risk models do not stratify large nodules well. We aimed to develop and validate an integrated segmentation and classification pipeline, incorporating deep-learning and traditional radiomics, to classify large lung nodules according to cancer risk.
Methods 502 patients from five U.K. centres were recruited to the large-nodule arm of the retrospective LIBRA study between July 2020 and April 2022. 838 CT scans were used for model development, split into training and test sets (70% and 30% respectively). An nnUNet model was trained to automate lung nodule segmentation. A radiomics signature was developed to classify nodules according to malignancy risk. Performance of the radiomics model, termed the large-nodule radiomics predictive vector (LN-RPV), was compared to three radiologists and the Brock and Herder scores.
Interpretation The model accurately segments and classifies large lung nodules, and may improve upon existing clinical models.

Introduction
Incidental lung nodules are a common finding on CT scans. Most are benign, but some represent earlystage cancers and provide an opportunity for early lung cancer diagnosis. 1 Correctly stratifying nodules is challenging, because triaging a high-risk nodule as low-risk could lead to delayed cancer diagnosis, but over-investigating low-risk nodules may expose patients to undue complications. Therefore many guidelines have been developed to support management decisions, which incorporate nodule size as a key risk-factor. [2][3][4][5][6][7] The American College of Radiology Lung-RADS screening criteria place solid nodules ≥15 mm into the highest risk category (4B), recommending consideration of biopsy. 7 A 15 mm threshold is supported by data from a study of 2821 nodules, which found that clinical risk factors for malignancy differed above and below this cut-off in multivariable regression. 8 Nevertheless, the malignancy rate in ≥15 mm/Lung-RADS 4B nodules is still variable (23.5% -36.3%), and additional non-invasive biomarkers may help to identify those most at risk. 9,10 In the United Kingdom (U.K.), the British Thoracic Society (BTS) guidelines are used to investigate incidental nodules. 3 These guidelines use a Brock score threshold of ≥10% to trigger further investigation of solid nodules, which would be met by a 50-year-old woman with a 15 mm nodule and no other risk factors, and may therefore not stratify large (15-30 mm) nodules well. 11 The BTS algorithm also incorporates the Herder score, which utilises PET-data. 12 The original Herder model was developed 17 years ago in a small patient cohort, but remains a central component of nodule multidisciplinary meetings across the U.K. 3 Patients within the 10-70% Herder category have a broad range of possible clinical actions, and methods to better stratify this group are needed. Although Herder has been validated in modern datasets, it remains to be seen how machine-learning based approaches could enhance the model to improve patient stratification. 13

Research in context
Evidence before this study The current guidelines for investigating lung nodules rely on clinical risk models, such as the Brock and Herder scores, and most nodules above 15 mm will trigger the 10% threshold for investigation. Many large nodules fall into the 10-70% Herder category, wherein the British Thoracic Society Guidelines suggest a broad range of options, from surveillance to surgery, and methods to improve stratification are needed. In the many years since the Herder model was developed, few studies have investigated how it could integrate with noninvasive radiomics models to improve early cancer diagnosis rates, and no existing studies have looked at large (15-30 mm) nodules only.

Added value of this study
This study developed a radiomics-based cancer prediction model in 15-30 mm lung nodules, which are not stratified well by existing guidelines. The developed model, termed the large-nodule radiomics predictive vector, achieved higher cancer prediction accuracy than the Brock score, and by integrating with the Herder model, would have led to early intervention in 82% of the malignant nodules with Herder scores of 10-70%. Because the model requires fewer variables than the Brock and Herder scores, it could potentially streamline the risk-classification process for clinicians in the future, particularly where PET scanning is not available or will be delayed. The use of a highly-accurate deep learning segmentation pipeline means that the model is not dependent on human nodule segmentation.

Implications of all the available evidence
The large nodule radiomics model improves upon or extends existing clinical models, and integrates with the British Thoracic Society guidelines to provide net-benefit in terms of early cancer intervention. Although prospective evaluation is needed, this tool may aid clinician decision making with regards to large lung nodules in the future.

Articles
The requirement for additional decision-support is particularly important following the COVID-19 pandemic, which caused disruption to diagnostic services. 14 This may be especially relevant in the U.K., where PET availability lags behind other European countries, and may not be routinely available at all centres. 15 Finally, because Brock and Herder both require a large number of clinical variables, they can be time consuming to calculate, and non-invasive methods with fewer data points could streamline the decisionmaking process for clinicians.
Many radiomics models for nodule classification have been developed in recent years. [16][17][18][19][20][21][22] Baldwin et al. validated a lung nodule convolutional neural network (LN-CNN) in 1187 patients with 5-15 mm nodules, achieving an AUC of 89.6%. 21 A broad range of different size criteria have been utilised, including mixed (5-30 mm) and small nodule only (5-15 mm) cohorts, but no studies have explored the utility of radiomics in large (≥15 mm) nodules, where malignancy risk is highest but still variable. 9 Given that the aetiology and risk may differ, and that models perform best on data resembling the training cohort, we hypothesise that a 15-30 mm nodule model may be able to integrate with the Herder score to improve early diagnosis.
Through the Lung Imaging Biobank for Radiomics and AI research (LIBRA), we aimed to develop a pipeline for multi-centre radiomics research capturing realworld, heterogeneous data, and to create a radiomics algorithm to accurately classify large lung nodules according to cancer risk. Finally, we sought to develop a decision-support tool to reduce delayed cancer diagnosis rates in the broad 10-70% Herder risk group.

Ethics
Health Regulatory Authority (HRA) and research ethics committee (REC) approval were obtained for the Lung Imaging Biobank for Radiomics and AI (LIBRA) retrospective cohort study (IRAS ID: 274775, REC reference 20/NI/0088, Clinical-Trials.gov: NCT04270799). Patient consent was not required. Patients were recruited between 1st July 2020 and 1st April 2022 by the clinical teams at participating centres (Fig. 1). Exclusion criteria:

Study sample
• Absence of analysable scans.
Up to three CT scans (baseline, interim and the final follow-up) were included for each patient. The final diagnosis at the last scan defined the ground truth for all other scans for each patient. Only a single nodule per scan was included.
Patients were identified for the external test set from the LIDC-IDRI, LUNGx and NSCLC radiogenomics public data sets. [23][24][25] Because of the small number of eligible patients in the LUNGx and LIDC data sets, up to two nodules per patient were included, leading to a total of 151 nodules in 147 patients. Clinicodemographic data for the external test set are shown in Supplementary Table S1.

Data anonymisation and storage
CT scans were link-anonymised at local centres using DICOM Browser or centre-specific methods where required. Anonymised DICOM images and demographic data were uploaded to the LIBRA XNAT server.

Radiologist benchmarking
The 252 test-set scans were reviewed by three clinical radiologists: two were post-FRCR with over 5 years of experience (MC, EA) and one was pre-FRCR with 3 years of experience (AL). The readers were blinded to clinical data including the malignancy status, but were able to see the entire CT scan including the background lung parenchyma. Scans were rated using a 5-point scale: 1benign, 2probably benign, 3indeterminate, 4probably malignant and 5malignant. Receiver-Operator Characteristic (ROC) curves were then constructed to calculate AUCs.
Images and segmentations were resampled to 1 × 1 × 2 mm voxel dimensions using cubic spline and nearest neighbour interpolation respectively. Intensity values were capped at −2000 to +2000.

Radiomics model development
Data were randomly split into training and test sets (70% and 30% respectively) using the sample.split R function, maintaining equal proportions of malignant nodules. The split was grouped by study ID to prevent data leakage when multiple scans were present for a given patient.
Radiomic features were extracted using TexLab 2.0, developed in MATLAB 2015b (Mathworks Inc., Nathick, Massachusetts, USA) using 25HU intensity bins. TexLab initially extracts 666 features, including highorder wavelet transformations. To improve study interpretability, we removed wavelet features prior to model development. The 82 remaining features were scaled using Z-standardisation (X− X /SD). Univariable logistic regression was performed for each feature against the cancer status, and those with p values < 0.05 (Wald test) after Benjamini-Hochberg (B-H) correction were selected for the LASSO logistic regression model. Ten-fold cross-validation was used to select the largest value of lambda giving a cross-validated error within one standard error of the minimum (lambda.1se). The weighted sum of features with non-zero coefficients yielded the large-nodule radiomics predictive vector (LN-RPV).
K-means clustering was used to divide the trainingset into low and high-risk subgroups based on the RPV (Supplementary Fig. S1). The same criteria were applied to the test set.

Auto-segmentation
Scans and masks were cropped to the maximal 3D segmentation dimensions. For auto-segmentation, we used the nnUNet, a self-calibrating network that automates hyperparameter optimisation and 5-fold crossvalidation. 26 Each fold was trained for 1000 epochs before hyperparameter selection. Training and test set performance were evaluated using the DICE score.

Clinical modelling
Variables required for Brock and Herder calculation were obtained from patient records. For the purpose of Herder calculation, patients with no recorded PET data were taken to be PET negative.
Univariable logistic regression was performed to select predictive clinical features (Wald test p < 0.001 after B-H correction). Categorical variables were converted to dummy variables prior to training, with the most common level becoming the reference standard. Multivariable logistic regression models were developed incorporating the LN-RPV and statistically significant clinical features.
For comparison of the radiomics model with the Brock and Herder scores, we used a subset of the 252 test set scans pertaining to only baseline CT images containing solid nodules (n = 117), which match the 'initial approach to solid pulmonary nodules' algorithm of the BTS guidelines.
To assess the impact of the LN-RPV, we devised a decision-support tool to assess how the model could reduce missed diagnoses or delayed treatment associated with the 10-70% Herder score category (Fig. 2). Decision support impact was modelled for solid nodules in the test set (n = 174).

Statistical analysis
Analyses were performed in R Studio (v1.3.1073) and Python (v3.7.3). Due to the exploratory nature of this work, the target recruitment size was based on expert Articles consensus only. All p values were two sided, and a cutoff of 0.05 was used for statistical significance. 95% confidence intervals for AUC values were obtained via bootstrapping with 1000 iterations. DeLong's test was used to compare model ROC curves. As the distribution of the LN-RPV was non-normal, we used the Kruskal-Wallis test to look for interactions between scan vendor and LN-RPV. Intra-class correlation co-efficients (ICCs) were analysed using the icc R function with the following parameters: Model: two-way, type: agreement, and unit: single.

Patient and scan characteristics
Overall, we recruited 502 patients, of whom 499 had evaluable scans. Of the 499, the mean age was 68.94 (± SD 10.73), with 257 male and 242 female patients. Table 1 shows the distribution of clinicodemographic features amongst the training and test sets at the scan level.
The overall proportion of benign vs malignant nodules was 37.5 vs 62.5% respectively. The data set included a mixture of solid (70.6%), subsolid (22.8%) and ground-glass opacities (6.6%), with proportions well balanced amongst the training and test sets. CT scans were acquired from five institutions and four scan vendors (GE Medical System, Philips, Siemens and Toshiba). 464 (55%) scans were non-contrast, and a large mixture of soft and sharp reconstruction kernels were included (Table 2).
The LN-RPV AUC was 0.76 (95% CI 0.73-0.80) in the training set and 0.83 (95% CI 0.77-0.88) in the test set. The threshold which yielded the maximum Youden index in the training set was −0.1991184. Using this threshold, the model achieved an accuracy of 76% (95% CI 0.70-0.81), sensitivity and specificities of 90% and 53% respectively, and an F1 score of 0.83 in the test set (Fig. 4b).

Clinical modelling
A total of 14 clinical variables were assessed by univariable regression against cancer status ( Table 3). 6 of these features were highly significant (p < 0.001, Wald test) after correction for multiple testing, and were selected for the multivariable model: Brock score, Herder score, a history of lung disease, a history of extra-thoracic malignancy, nodule density and PET avidity. Because of the potential issue of collinearity between the Brock, Herder and PET status, we calculated the Variance Inflation Factor (VIF) for each feature. The VIF value was 2.12 for the Brock score. The values for the Herder score, moderate and intense PET avidity were 31, 25 and 19, suggesting a high level of collinearity between Herder and PET status. Therefore, PET status was removed from the model.
The results of the multivariable analysis including clinical features and the LN-RPV are shown in Table 4. The Brock score was non-significant (p 0.63, Wald test). The highest feature weights were LN-RPV (0.25), subsolid density (0.23) and ground glass density (0.21). Both LN-RPV and the Herder score had p values < 0.001, but the Herder had a low weight of 0.004. In the test set, the combined clinical-radiomics model did not perform better than the LN-RPV model alone (AUC 0.82, 95% CI 0.76-0.87, DeLong p = 0.56). We also developed fusion models incorporating both the Herder score and radiomics features ( Supplementary  Figs. S3 and S4). Early and late fusion models were not statistically significantly better than the Herder score alone.

Auto-segmentation performance
Example test set nnUNet nodule segmentation masks are shown in Supplementary Fig. S2. The model achieved a DICE score of 0.86 (SE 0.005). To evaluate the effect of the nnUNet auto-segmentation on the LN-RPV, features were extracted using manual and automated segmentation methods for comparison in the test set ( Fig. 6). There was high correlation between the manual and automated LN-RPV (r = 0.95), with an ICC of 0.94, suggesting very high concordance between the segmentation methods.

Radiomics feature robustness
There was no statistically significant interaction between scan vendor and LN-RPV (p = 0.46, Kruskal-Wallis test).

Clinical decision-support
There were 38 solid test set nodules with a herder score <10%. The cross-tabulation of LN-RPV and Herder risk groups is shown for solid nodules in the test set with Abbreviations: SD, Standard Deviation, GGO, ground-glass opacity, ca,cancer.

Articles
Herder scores ≥10% (n = 136) in Table 5. Of the 39 nodules with a Herder score of 10-70%, there were 22 malignant nodules (56%). 18 (82%) of these malignant nodules had a high LN-RPV, and would have been upgraded to early intervention using the decisionsupport tool.

Discussion
Through the LIBRA study we have established a noncommercial, national pipeline for AI-based lung cancer early diagnosis research, incorporating heterogenous data from multiple institutions and scan vendors. Using this data, we developed the LN-RPV, an artificialintelligence algorithm targeted specifically at large lung nodules, where many patients fall into a middle Herder category of 10-70% with variable management options. The LN-RPV performed better than the median radiologist and can be integrated with the BTS guidelines to reduce the risk of delayed cancer treatment.
Previous studies have reported that the Brock score has good predictive utility outside of the screening setting, which was not replicated in our incidental large nodule cohort. 27 For solid nodules in the test set, we found that the Brock score was only moderately discriminative (AUC 0.67). This may support the hypothesis that it does not perform as well for large nodules, though the original model was intended for screening populations. The Herder score had better performance (AUC 0.83), but did not outperform the LN-RPV (AUC 0.87), which would have led to earlier intervention in 82% of the malignant nodules with Herder scores of 10-70%. As the BTS algorithm and Herder score are used widely across the UK for nodule   stratification, our model has the potential to improve early cancer diagnosis and treatment by highlighting which patients are high-risk and recommending they be fast-tracked to intervention. As the LN-RPV consists of only two features, compared to the 7 values input for Herder, it could potentially streamline or automate the process of nodule risk calculation (albeit with the caveat that it requires an image-analysis pipeline). Moreover, for centres without routine access to PET, or where PET scanning will be delayed, the LN-RPV could give an earlier indication of malignancy probability. We also note the wide variability in Herder score AUCs in the literature, which likely reflects the qualitative nature of PET reporting, and may suggest Herder performance is    Articles less reliable outside of expert centres. 13,28 Through incorporation with the nnUNet model, we have minimised the model's dependency on manual segmentation, which may allow easier integration into clinical workstreams in the future. The first feature comprising the LN-RPV is the nodule surface-to-volume ratio, defined as the surface area divided by the total volume (SNS_s2v). The second feature is the gray-level co-occurrence matrix (GLCM) correlation (GLCM_Correl). GLCMs describe counts of co-occurring voxel gray-level intensities at given angles within the image, and the correlation metric assesses the linear dependency of gray-level values to their voxels within the GLCM. GLCM features have been used to classify benign and malignant lesions in other disease groups, including breast cancer. 29 In non-small cell lung cancer (NSCLC), GLCM features are associated with the degree of tumour immune-infiltration, PDL1 expression and patient survival. 30 Taken together, we hypothesise that the LN-RPV reflects the degree of nodule diffuseness and intra-tumoural heterogeneity,  and could relate to spatial differences in tumour hypoxia or immune infiltration. 30,31 In recent years, many lung nodule radiomics studies have been published, spanning a range of nodule sizes. [32][33][34][35][36] Liu et al. developed a pre-operative radiomics nomogram using 875 patients with ≤30 mm nodules from a single centre, with a validation AUC of 0.81. 37 The final feature set consisted of 20 features, four of which were shape or first order related, with the remainder consisting of GLCM, GLRLM, NGTDM and wavelet transformation features. The surface area to volume ratio was not amongst their feature set, meaning it could be a discriminating feature specific to large nodules. However, our second selected feature, GLCM_Correlation, is common between both models, and could be an important predictor of malignancy. We believe the advantage of our two-feature model, which does not include wavelet transformations, is that it is more readily interpretable and reproducible.
Although the LN-RPV retained good performance in the external test set (accuracy 76%), this data was obtained from public imaging databases which may not closely match the setting in which the algorithm is intended to be used. Therefore additional external testing with large, representative datasets is required before generalisable clinical use. Prospective evaluation in a real-world nodule MDT is the next step to verify its clinical utility.
Aside from the external test set, there are some other limitations to consider. Firstly, the model does not incorporate changes in radiomics features over time, which is an area for future development. Secondly, though we have developed an auto-segmentation pipeline, a truly integrated solution whereby all preprocessing, segmentation and extraction steps are unified into a single program has not yet been developed. Thirdly, a limitation of the clinical decision-support scenario is that imputation of the PET as negative when missing could underestimate Herder score performance. And finally, we note that the LN-RPV was not statistically significantly better than the Herder score using DeLong's test. However, it has been noted by Vickers et al. that the DeLong test is conservative, and that a single test, namely multivariable regression incorporating established variables and the novel predictor, is sufficient to draw conclusions about a new model's utility. 38 As the LN-RPV retained significance in multivariable testing, and identified cancers within the established Herder category of 10-70%, we believe that meaningful conclusions can be drawn about its utility in the context of established clinical models.
In summary, the LIBRA study has provided a national pipeline for multi-centre lung nodule AI research, which has been used to develop a large nodule classification algorithm for lung cancer diagnosis. Our model appears to perform better than clinical radiologists and the Brock score, and comparably to the Herder score. The modelled decision-support scenarios suggest it could lead to earlier-intervention for malignant nodules in the 10-70% Herder category, which could potentially save lives through early intervention in the future.

Data sharing statement
The anonymised spreadsheets of radiomics features and clinical outcomes used to generate the LN-RPV model are deposited into the Mendeley database under the accession code https://doi.org/10.17632/ rz72hs5dvg.1. The R scripts for model development are provided in notebook format at: https://github.com/dr-benjamin-hunter/LIBRA_ Large_Nodules. Access to the source images/data will be considered on request to Dr. Richard Lee. LN-RPV risk groups were designated as low or high based on K-Means clustering groups. The number of actual benign and malignant nodules are presented as (benign/malignant) for each combination of Herder and model risk groups. The other authors report no potential conflict of interest.